Steve Elston
01/13/2021
The general process of hypothesis testing can be described as:
Compare the p-value to the cutoff selected
There are only two possible conclusions
We reject the null hypothesis with confidence alpha (on the p-value)
We cannot reject the null hypothesis since either, i) the null hypothesis is true and there is no difference between the data distribution of the null distribution, or ii) there is insufficient evidence given the effect size to reject the null hypothesis (insufficient power)
Beware of multiple hypothesis tests:
High error rate
Can end up ‘p-value fishing’
For one-tailed test of hypothesis
bg right w:500
There are a great many tests to choose from
| Data | Hypothesis | Distribution | Test |
|---|---|---|---|
| Two samples of continuous variable | Difference of means | t | t-test |
| Counts for different categories | Counts are different | \(\chi^2\) | Pearson’s \(\chi^2\) |
| Multiple (groups) of continuous variables | Distribution of groups are different | \(F = \frac{between\ group\ variance}{within\ group\ variance}\) | ANOVA |
| Sample from a distribution | Does the variable have a distribution | Kolmogorov-Smirnov | |
| Kolmogorov-Smirnov |
Despite the long history, Bayesian models have not been used extensively until recently
Bayesian analysis is a contrast to frequentist methods
The objective of Bayesian analysis is to compute a posterior distribution
Contrasts with frequentist statistics is to compute a point estimate and confidence interval from a sample
Bayesian models allow expressing prior information in the form of a prior distribution
Selection of prior distributions can be performed in a number of ways
The posterior distribution is said to quantify our current belief
We update beliefs based on additional data or evidence
A critical difference with frequentist models which must be computed from a complete sample
Inference can be performed on the posterior distribution by finding the maximum a postiori (MAP) value and a credible interval
Bayesian methods made global headlines with the successful location of the missing Air France Flight 447
Aircraft had disappeared in little traveled area of the South Atlantic Ocean
Conventional location methods had failed to locate the wreckage; potential search area too large
Bayesian methods rapidly narrowed the prospective search area
Posterior distribution of locations of Air France 447
With greater computational power and general acceptance, Bayes methods are now widely used
Among pragmatists
Bayes models allow us to express prior information
Models that fall between these extremes are also in common use
Can compare the contrasting frequentist and Bayesian approaches
Comparison of frequentist and Bayes methods
Bayes’ Theorem is fundamental to Bayesian data analysis.
\[P(A \cap B) = P(A|B) P(B) \]
We can also write:
\[P(A \cap B) = P(B|A) P(A) \]
Eliminating \(P(A \cap B):\)
\[ P(B)P(A|B) = P(A)P(B|A)\]
Or, Bayes theorem!
\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
In many cases we are interested in the marginal distribution
\[p(\theta_1) = \int_{\theta_2, \ldots, \theta_n} p(\theta_1, \theta_2, \ldots, \theta_n)\ d\theta2, \ldots, d \theta_n\] - But computing this integral is not easy!
For discrete distributions compute the marginal by summation
Or, for discrete samples of a continuous distribution
Example, we know posterior distribution of a parameter \(\theta\) but really want marginal distribution of the parameter value
\[ p(\theta) = \sum_{x \in \mathbf{X}} p(\theta |\mathbf{X})\ p(\mathbf{X}) \]
Now we have the marginal distribution of \(\theta\)
Or, we need to find the denominator for Bayes theorem to normalize our posterior distribution:
\[ p(\mathbf{X}) = \sum_{\theta \in \Theta} p(\mathbf{X} |\theta) p(\theta) \]
How can you interpret Bayes’ Theorem?
\[Posterior\ Distribution = \frac{Likelihood \bullet Prior\ Distribution}{Evidence} \]
\[ posterior\ distribution(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \\ \frac{Likelihood(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃rior(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]
\[ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \frac{P(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]
What do these terms actually mean?
Posterior distribution of the parameters given the evidence or data, the goal of Bayesian analysis
Prior distribution is chosen to express information available about the model parameters apriori
Likelihood is the conditional distribution of the data given the model parameters
Data or evidence is the distribution of the data and normalizes the posterior
Relationships can apply to the parameters in a model; partial slopes, intercept, error distributions, lasso constants, etc
We need a tractable formulation of Bayes Theorem for computational problems
\[ 𝑃(𝐵 \cap A) = 𝑃(𝐵|𝐴)𝑃(𝐴) \\ And \\ 𝑃(𝐵)=𝑃(𝐵 \cap 𝐴)+𝑃(𝐵 \cap \bar{𝐴}) \]
Where, \(\bar{A} = not\ A\), and the marginal distribution, \(P(B)\), can be written:
\[ 𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴}) \]
Using the foregoing relations we can rewrite Bayes Theorem as:
\[ P(A|B) = \frac{P(A)P(B|A)}{𝑃(𝐵│𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴})} \\ \]
Rewrite Bayes Theorem as:
\[𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)\]
Ignoring the normalization constant \(k\):
\[𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)\]
Denominator must account for all possible outcomes, or alternative hypotheses, \(h'\):
\[Posterior(hypothesis\ |\ evidence) =\\ \frac{Likelihood(evidence\ |\ hypothesis)\ prior(hypothesis)}{\sum_{ h' \in\ All\ possible\ hypotheses}Likelihood(evidence\ |\ h')\ prior(h')}\]
This is a formidable problem!
Hemophilia is a serious genetic condition expressed on any X chromosome
As evidence the woman has two sons (not identical twins) with no expression of hemophilia
What is the likelihood for the two sons \(X = (x_1,x_2)\) not having hemophilia?
Two possible cases here
\[ p(x_1=0, x_2=0 | \theta = 1) = 0.5 * 0.5 = 0.25 \\ p(x_1=0, x_2=0 | \theta = 0) = 1.0 * 1.0 = 1.0 \]
Note: we are neglecting the possibility of a mutations in one of the sons
Use Bayes theorem to compute probability woman carries an X chromosome with hemophilia expression, \(\theta = 1\)
\[ p(\theta=1 | X) = \frac{p(X|\theta=1) p(\theta=1)}{p(X|\theta=1) p(\theta=1) + p(X|\theta=0) p(\theta=0)} \\ = \frac{0.25 * 0.5}{0.25 * 0.5 + 1.0 * 0.5} = 0.20 \]
The evidence of two sons without hemophilia causes us to update our belief that the probability of the woman carrying the disease
Note: The denominator is the sum over all possible hypthises, the marginal distribution of the observations \(\mathbf{X}\)
How to we interpret the foregoing relationship?
Rewrite Bayes Theorem as:
\[𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)\]
Ignoring the normalization constant \(k\):
\[𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)\]
\[Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\ Or\\ 𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 │ 𝑑𝑎𝑡𝑎 ) \propto 𝑃( 𝑑𝑎𝑡𝑎 | 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) \]
We can find an unnormalized function proportional to the posterior distribution
Sum over the function to find the marginal distribution \(P(B)\)
Approach can transform an intractable computation into a simple summation
The goal of a Bayesian analysis is computing and performing inference on the posterior distribution of the model parameters
The general steps are as follows:
Identify data relevant to the research question
Define a sampling plan for the data. Data need not be collected in a single batch
Define the model and the likelihood function; e.g. regression model with Normal likelihood
Specify a prior distribution of the model parameters
Use the Bayesian inference formula to compute posterior distribution of the model parameters
Update the posterior as data is observed
Inference on the posterior can be performed; compute credible intervals
Optionally, simulate data values from realizations of the posterior distribution. These values are predictions from the model.
An advantage of Bayesain model is that it can be updated as new observations are made
In contrast, for frequentist models data must be collected completely in advance
We update our belief by adding new evidence
The posterior of a Bayesian model with no evidence is the prior
The previous posterior serves as a prior for model updates
The choice of the prior is a difficult, and potentially vexing, problem when performing Bayesian analysis
The need to choose a prior has often been cited as a reason why Bayesian models are impractical
General guidance is that a prior must be convincing to a skeptical audience is vague in practice
Some possible approaches include:
Use prior empirical information - emperical Bayes
Apply domain knowledge to determine a reasonable distribution
If there is poor prior knowledge for the problem a non-informative prior can be used
How to use prior empirical information to estimate the parameters of the prior distribution
An analytically and computationally simple choice for a prior distribution family is a conjugate prior
When a likelihood function is multiplied by its conjugate distribution the posterior distribution will be in the same family as the conjugate prior
Attractive idea for cases where the conjugate distribution exists
But there are many practical cases where a conjugate prior is not used
Most commonly used distributions have conjugates, with a few examples:
| Likelihood | Conjugate |
|---|---|
| Binomial | Beta |
| Bernoulli | Beta |
| Poisson | Gamma |
| Categorical | Dirichlet |
| Normal - mean | Normal |
| Normal - variance | Inverse Gamma |
| Normal - inverse variance, \(\tau\) | Gamma |
We are interested in analyzing the incidence of distracted drivers
Randomly sample the behavior of 10 drivers at an intersection and determine if they exhibit distracted driving or not
Data are Binomially distributed, a driver is distracted or not, with likelihood:
\[ P(k) = \binom{n}{k} \cdot \theta^k(1-\theta)^{n-k}\]
Our process is the:
What are the properties of the Beta distribution?
Beta distribution for different parameter values
Consider the product of a Binomial likelihood and a Beta prior
Define the evidence as \(n\) trials with \(z\) successes
Prior is a Beta distribution with parameters \(a\) and \(b\), or the vector \(\theta = (a,b)\)
From Bayes Theorem the distribution of the posterior:
\[\begin{align} posterior(\theta | z, n) &= \frac{likelihood(z,n | \theta)\ prior(\theta)}{data\ distribution (z,n)} \\ p(\theta | z, n) &= \frac{Binomial(z,n | \theta)\ Beta(\theta)}{p(z,n)} \\ &= Beta(z + a -1,\ n-z+b-1) \end{align}\]
There are some useful insights you can gain from this relationship:
\[ posterior(\theta | z, n) = Beta(z + a -1,\ n-z+b-1) \]
Posterior distribution is in the Beta family, as a result of conjugacy
Parameters \(a\) and \(b\) are determined by the prior and the evidence
Parameters of the prior can be interpreted as pseudo counts of successes, \(a = pseudo\ success + 1\) and failures, \(b = pseudo\ failure + 1\)
-Evidence is also in the form (actual) counts of successes, \(z\) and failure, \(n-z\)
- The more evidence the greater the influence on the posterior distribution
- Large amount of evidence will overwhelm the prior
- With large amount of evidence, posterior converges to frequentist model
Consider example with:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 2\) and failures, \(= 8\)
- Posterior is \(Beta(10 + 2 -1,\ 10 + 8 -1) = Beta(11,\ 17)\)
##
##
## Maximum of the prior density = 0.100
## Maximum likelihood 0.200
## MAP = 0.150
Adding additional evidence, same prior:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 10\) and failures, \(= 30\)
- Posterior is \(Beta(10 + 2 -1,\ 40 - 10 + 10 -1) = Beta(11,\ 39)\)
##
##
## Maximum of the prior density = 0.100
## Maximum likelihood 0.340
## MAP = 0.280
How can we specify the uncertainty for a Bayesian parameter estimate?
Example, the \(\alpha = 0.90\) credible interval encompasses the 90% of the posterior distribution with the highest density
The credible interval is sometime called the highest density interval (HDI), or highest posterior density interval (HPDI)
For symmetric distributions the credible interval can be numerically the same as the confidence interval
What are the 95% credible intervals for \(Beta(11,\ 39)\)?
Probability of distract drivers for next 10 cars
What else can we do with a Bayesian posterior distribution beyond credible intervals?
Perform simulations and make predictions
Predictions are computed by simulating from the posterior distribution
Results of these simulations are useful for several purposes, including:
Example; What are the probabilities of distracted drivers for the next 10 cars with posterior, \(Beta(11,\ 39)\)?
Probability of distract drivers for next 10 cars
Intervals based on 1,000 trials
Use the chain rule of probability to factor a joint distribution into hierarchy
\[P(A,B) = P(A|B)P(B)\]
\[P(A_1, A_2, A_3, A_4 \ldots, A_n) = P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2, A_3, A_4 \ldots, A_n)\]
\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2 | A_3, A_4 \ldots, A_n)\\ P(A_3| A_4 \ldots, A_n) \ldots P(A_n)\]
The factorization is not unique.
Factor the variables in any order
For a joint distribution with \(n\) variables, there are \(n!\) unique factorizations
Example, we can factorize the foregoing distribution as:
\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_n | A_{n-1}, A_{n-2}, A_{n-3}, \ldots, A_1)\ P(A_{n-1}| A_{n-2}, A_{n-3}, \ldots, A_1)\\ P(A_{n-2}| A_{n-3}, \ldots, A_1) \ldots p(A_1)\]
Roll a ball on a billiard table and mark its position on the table length-wise (across only one dimension)
Alice and Bob take turns rolling balls and look at where they land in relation to the first ball
Alice and Bob don’t know where the first ball landed
denote the current score with \(D\) (as in data)
What is the probability of Bob winning the game?
- Ball needs to land on Bob’s side the next three rounds.
bg w:1000
Let \(B\) be the event that Bob wins, \(D\) the data, or current score
Bayesian analysis is a contrast to frequentist methods
The objective of Bayesian analysis is to compute a posterior distribution
Contrasts with frequentist statistics is to compute a point estimate and confidence interval from a sample
Bayesian models allow expressing prior information in the form of a prior distribution
Selection of prior distributions can be performed in a number of ways
The posterior distribution is said to quantify our current belief
We update beliefs based on additional data or evidence
A critical difference with frequentist models which must be computed from a complete sample
Inference can be performed on the posterior distribution by finding the maximum a postiori (MAP) value and a credible interval
Predictions are made by simulating from the posterior distribution